## ── Attaching packages ────────────────────────────────────────────── fpp3 0.5 ──
## ✔ tibble      3.2.1     ✔ tsibble     1.1.4
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.0     ✔ feasts      0.3.2
## ✔ lubridate   1.9.3     ✔ fable       0.3.3
## ✔ ggplot2     3.5.1     ✔ fabletools  0.4.2
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## ── Attaching packages ────────────────────────────────────────────── fpp2 2.5 ──
## ✔ forecast  8.20     ✔ expsmooth 2.3 
## ✔ fma       2.5
## 
## 
## Attaching package: 'fpp2'
## The following object is masked from 'package:fpp3':
## 
##     insurance
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:fma':
## 
##     pigs
## corrplot 0.92 loaded

Introduction

This assignment is to be submitted via email by the designated group representative and follows the Hyndman Version for the problems assigned in batch #1. The first batch consists of 12 problems, drawn from both KJ and HA sources. The problems to be addressed are as follows:

The tasks outlined in these problems will require a thorough understanding of the materials covered in the specified chapters, ensuring a comprehensive grasp of the subject matter.

Exercise 2.3

  1. Download some monthly Australian retail data from OTexts.org/fpp2/extrafiles/retail.xlsx. These represent retail sales in various categories for different Australian states, and are stored in a MS-Excel file.
  1. Read the data into R

  2. Select one of the time series as follows (but replace the column name with your own chosen column):

  3. Explore your chosen retail time series using the following functions:

autoplot(), ggseasonplot(), ggsubseriesplot(), gglagplot(), ggAcf()

Can you spot any seasonality, cyclicity and trend? What do you learn about the series?

The autoplot shows a strong seasonality to the data, as well as an upward trend. Though there is a brief dip from 1990-2000, there is no evidence that this is part of a cycle yet.

The seasonal plot emphasizes the seasonality of the data. Sales start to rise in the fall before spiking sharply between November and December, then falling off after January, obviously coinciding with holiday shopping and sales for Christmas.

Again, the subseries highlights the seasonality of the data, but paints it clearer than the seasonal plot. Though sales rise from September, the floor actually remains the same. The only real difference is in December, which not only has a higher ceiling, but a higher floor as well.

The data is not very readable in this lag series. We can see some negative relationships and some positive relationships, but the amount of graphs, and the fact that this is monthly, make it difficult to discern much.

The decrease in lags highlights the trend, while the scalloped shape shows the seasonality of the sales data.


Exercise 7.1

  1. Consider the pigs series - the number of pigs slaughtered in Victoria each month.
    1. Use the ses() function in R to find the optimal values of alpha and l0 , and generate forecasts for the next four months.
##      time year quarter gilts profit s_per_herdsz production herdsz
## 1 1967.00 1967       1   105  8.075        10.80       2645    703
## 2 1967.25 1967       2   119  7.819         9.16       2540    722
## 3 1967.50 1967       3   119  7.366         9.38       2565    738
## 4 1967.75 1967       4   109  8.113        10.39       2776    747
## 5 1968.00 1968       1   117  7.380         9.44       2725    755
## 6 1968.25 1968       2   135  7.134         8.69       2623    780
## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = production_ts, h = 4) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 2645.2967 
## 
##   sigma:  144.1414
## 
##      AIC     AICc      BIC 
## 666.9711 667.5165 672.5847 
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 12.43225 141.1065 110.1921 0.3188315 3.565448 0.3549497
##                      ACF1
## Training set -0.003580523
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 1971       3241.984 3057.259 3426.709 2959.472 3524.496
## Feb 1971       3241.984 2980.757 3503.211 2842.472 3641.496
## Mar 1971       3241.984 2922.053 3561.915 2752.692 3731.276
## Apr 1971       3241.984 2872.563 3611.405 2677.003 3806.965

From the output, we observe the alpha to be 0.9999 and sigma to be 144.1414

    1. Compute a 95% prediction interval for the first forecast using y±1.96 s where s is the standard deviation of the residuals. Compare your interval with the interval produced by R.
## Manually computed 95% prediction interval:
## Lower Bound: 2959.47
## Upper Bound: 3524.5

Compare the manually computed interval with the one produced by R to assess their similarity. The manually computed interval provides a verification step to ensure consistency with the SES model’s prediction intervals. This comparison helps validate the accuracy and reliability of the SES model’s forecasting capabilities for the first forecasted value.

Exercise 7.2

  1. Write your own function to implement simple exponential smoothing. The function should take arguments y (the time series), alpha (the smoothing parameter alpha) and level (the initial level \(l_0\)). It should return the forecast of the next observation in the series. Does it give the same forecast as ses?
## [1] 2724.992

Based on the provided forecasts and intervals, the forecast from your custom SES function (2724.992) does not match the forecast (3241.984) produced by R’s ses() function for January 1971.

Exercise 7.3

Modify your function from the previous exercise to return the sum of squared errors rather than the forecast of the next observation. Then use the optim() function to find the optimal values of
alpha and lo. Do you get the same values as the ses() function?

## 
## Forecast method: Simple exponential smoothing
## 
## Model Information:
## Simple exponential smoothing 
## 
## Call:
##  ses(y = production_ts) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
## 
##   Initial states:
##     l = 2645.2967 
## 
##   sigma:  144.1414
## 
##      AIC     AICc      BIC 
## 666.9711 667.5165 672.5847 
## 
## Error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 12.43225 141.1065 110.1921 0.3188315 3.565448 0.3549497
##                      ACF1
## Training set -0.003580523
## 
## Forecasts:
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 1971       3241.984 3057.259 3426.709 2959.472 3524.496
## Feb 1971       3241.984 2980.757 3503.211 2842.472 3641.496
## Mar 1971       3241.984 2922.053 3561.915 2752.692 3731.276
## Apr 1971       3241.984 2872.563 3611.405 2677.003 3806.965
## May 1971       3241.984 2828.961 3655.007 2610.320 3873.648
## Jun 1971       3241.984 2789.541 3694.427 2550.033 3933.935
## Jul 1971       3241.984 2753.291 3730.677 2494.593 3989.375
## Aug 1971       3241.984 2719.551 3764.417 2442.991 4040.977
## Sep 1971       3241.984 2687.860 3796.108 2394.525 4089.443
## Oct 1971       3241.984 2657.887 3826.081 2348.685 4135.283
## [1] "Optimal alpha (optim()): 0.813489953565518"
## [1] "Optimal l0 (optim()): 3149.70117501139"

The values obtained from optim() for \(α\) and \(l_0\) are not the same as those obtained directly from the SES model using R’s ses() function. This discrepancy suggests that the optimization process (optim()) might have found a different set of parameters that minimized the sum of squared errors compared to the default parameters used by SES model.


Exercise 8.8

Consider austa, the total international visitors to Australia (in millions) for the period 1980-2015.

Step 1: Use auto.arima() to Find an Appropriate ARIMA Model

## Series: austa 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##          ma1   drift
##       0.3006  0.1735
## s.e.  0.1647  0.0390
## 
## sigma^2 = 0.03376:  log likelihood = 10.62
## AIC=-15.24   AICc=-14.46   BIC=-10.57

Step 2: Check That the Residuals Look Like White Noise

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1) with drift
## Q* = 2.297, df = 6, p-value = 0.8905
## 
## Model df: 1.   Total lags used: 7

Step 3: Plot Forecasts for the Next 10 Periods and Plot forecasts from ARIMA

The ARIMA(0,1,1) model with drift seems to provide a reasonable fit to the austa series, with non-significant autocorrelation in the residuals and relatively low information criteria values.

Exercise 3.2

  1. Why is a Box-Cox transformation unhelpful for the cangas data?

The transformation is unhelpful because, while it can help stabilize the variance, it doesn’t factor in other things like seasonality and trends. Other techniques like differencing or seasonal decomp would be required here before Box-Cox at least.

Exercise 6.2

  1. The plastics data set consists of the monthly sales (in thousands) of product A for a plastics manufacturer for five years.
  1. Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend-cycle? Clear cycles in the data based on specific time and year

  2. Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

  3. Do the results support the graphical interpretation from part a? Yes. The decomposed plot confirms seasonal fluctuations and a trend cycle

  4. Compute and plot the seasonally adjusted data.

  5. Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

  6. Does it make any difference if the outlier is near the end rather than in the middle of the time series? If you’re weighing more recent event heavily then yes, it does make an impact. Predictability goes down if you are weighing most recent data and there is an outlier.

Exercise 8.2

  1. A classic example of a non-stationary series is the daily closing IBM stock price series (data set ibmclose). Use R to plot the daily closing prices for IBM stock and the ACF and PACF. Explain how each plot shows that the series is non-stationary and should be differenced. Shows clear trends throughout, indicating non-stationarity and requires differencing Shows slow gradual downward trend, which also suggests non-stationarity and requires differencing Shows significant lag at first but then is non-zero for the rest, which suggests non-stationarity and requires differencing

Excercise 2.1

Question:

Use the help function to explore what the series gold, woolyrnq and gas represent.

  1. Use autoplot() to plot each of these in separate plots.
  2. What is the frequency of each series? Hint: apply the frequency() function.
  3. Use which.max() to spot the outlier in the gold series. Which observation was it?

Answer:

Help function

Let’s use the help function to explore the data series ‘gold’, ‘woolyrnq’ and ‘gas’

This data series represents the daily morning gold prices in US dollars from 1 January 1985 to 31 March 1989.

This data series tracks the Quarterly production of woollen yarn in Australia in tonnes from Mar 1965 to Sep 1994.

This data series tracks Australian monthly gas production from 1956 to 1995.

Data visualization

Here, we will use ‘autoplot’ to plot the time series separately

Observation:

  • Gold Price: The price of gold generally rises day over day until approximately day 770, where it experiences a significant spike. After this peak, the price trends downward until about day 1000.

  • Woolen Yarn Production: The production of woolen yarn exhibits a seasonal pattern, characterized by regular fluctuations with many ups and downs.

  • Gas Production: Gas production shows a seasonal pattern and a noticeable upward trend, particularly after 1970.

Frequency of the each series

## [1] 1
## [1] 4
## [1] 12

To summarize, here is the frequency of the series

  • gold: Daily data, so the frequency should be around 365.

  • woolyrnq: Quarterly data, so the frequency should be 4.

  • gas: Monthly data, so the frequency should be 12.

Outlier detection for the Gold data

## [1] 770
## [1] 593.7

The outlier in the gold series was the spike that occurred on day 770, with the gold price reaching 593.7 US dollars.

Exercise 3.1

Question:

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

  2. Do there appear to be any outliers in the data? Are any predictors skewed?

  3. Are there any relevant transformations of one or more predictors that might improve the classification model?

Answer:

Load the data set

First, we need to load the data set using the library mlbench

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...

This data frame has 214 observations of 10 variables. We can print the first 5 elements of the dataframe.

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Data visualization

To visualize the distributions of predictor variables, we can use boxplot for numerical variables.

Overall Observations:

  • The RI, K, SI, and Ba show a relatively narrow range across glass types

  • Some elements like Na, Mg exhibit greater variability.

A correlation heatmap can also be useful to understand the relationships between the predictor variables.

There is a high and positive correlation between elements Ca and RI while Si and RI for instance is negatively correlated

Outiers and Skewness

Based on the boxplots for the Glass dataset, outliers are observed in several predictor variables like RI, Mg, Si, and Ca for Type 2 glass.

Let’s check for skewness using histograms

The predictor Mg is skewed to the left while K, Ba, and Fe are skewed to the right

Classification Model

  • We can use Logarithmic Transformation for instance to compress the range of values and reduce the impact of outliers.

  • Similarly, Box-Cox transformation can handle various types of transformations to normalize the distribution.

Exercise 8.1

Question:

Figure 8.31 shows the ACFs for 36 random numbers, 360 random numbers and 1,000 random numbers. Explain the differences among these figures. Do they all indicate that the data are white noise?

Figure 8.31: Left: ACF for a white noise series of 36 numbers. Middle: ACF for a white noise series of 360 numbers. Right: ACF for a white noise series of 1,000 numbers.

Why are the critical values at different distances from the mean of zero? Why are the autocorrelations different in each figure when they each refer to white noise?

Answer:

  1. The differences between these figures lie in the lengths of each spike. As the number of random numbers increases (from 36 to 360), spikes in the ACF plots become shorter. The ACF plots display rapid decay and lack of significant autocorrelation patterns, suggesting white noise.

  2. The distances of critical values from the mean of zero in ACF plots reflect the precision of autocorrelation estimates, which is influenced by sample size.

Exercise 8.6

Answer:

Generate the data for AR(1)

Produce a time plot

To illustrate how the plot change as we change phi 1, let’s simulate and plot two scenarios with different values of phi 1

Higher values of phi 1 lead to stronger autocorrelation and smoother patterns

Generate data for MA(1)

Plot the data for MA(1)

Changing theta1 to -0.6

We observe that Positive values of theta1 create a smoother process with more persistence.

Generate and plot ARIMA(1,1)

Generate and plot AR(2)